Orthographic Case Restoration Using Supervised Learning Without Manual Annotation

نویسندگان

  • Cheng Niu
  • Wei Li
  • Jihong Ding
  • Rohini K. Srihari
چکیده

One challenge in text processing is the treatment of case insensitive documents such as speech recognition results. The traditional approach is to re-train a language model excluding case-related features. This paper presents an alternative two-step approach whereby a preprocessing module (Step 1) is designed to restore case-sensitive form to feed the core system (Step 2). Step 1 is implemented as a Hidden Markov Model trained on a large raw corpus of case sensitive documents. It is demonstrated that this approach (i) outperforms the feature exclusion approach for Named Entity tagging, (ii) leads to limited degradation for semantic parsing and relationship extraction, (iii) reduces system complexity, and (iv) has wide applicability: the restored text can feed both statistical model and rule-based systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Self-Learning to Detect and Segment Cysts in Lung CT Images without Manual Annotation

Image segmentation is a fundamental problem in medical image analysis. In recent years, deep neural networks achieve impressive performances on many medical image segmentation tasks by supervised learning on large manually annotated data. However, expert annotations on big medical datasets are tedious, expensive or sometimes unavailable. Weakly supervised learning could reduce the effort for an...

متن کامل

Superpixel clustering with deep features for unsupervised road segmentation

Vision-based autonomous driving requires classifying each pixel as corresponding to road or not, which can be addressed using semantic segmentation. Semantic segmentation works well when used with a fully supervised model, but in practice, the required work of creating pixel-wise annotations is very expensive. Although weakly supervised segmentation addresses this issue, most methods are not de...

متن کامل

Implicitly-Supervised Learning in Spoken Language Interfaces: An Application to the Confidence Annotation Problem

In this paper we propose the use of a novel learning paradigm in spoken language interfaces – implicitly-supervised learning. The central idea is to extract a supervision signal online, directly from the user, from certain patterns that occur naturally in the conversation. The approach eliminates the need for developer supervision and facilitates online learning and adaptation. As a first step ...

متن کامل

Transfer Learning by Ranking for Weakly Supervised Object Annotation

Object detectors [5] locate objects of interest in images and have many applications including image tagging, consumer photography, and surveillance. Most existing object detectors take a fully supervised learning (FSL) approach, where all the training images are manually annotated with the object location. However, manual annotation of hundreds of object categories is time-consuming, laborious...

متن کامل

A Multi-dimensional Annotation Scheme for Opinion Mining from Unstructured Data

We present a comprehensive annotation scheme for opinion mining of product review data. We motivate our multi-dimensional annotation scheme both from a linguistic perspective and a task based perspective, and contrast it with other existing annotation schemes for opinion mining. A coding manual and annotated corpus of product review data, both of which were prepared using a stringent methodolog...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003